Implement Status Reporting for EtcdCluster #136

ballista01 · 2025-05-05T04:27:07Z

This PR introduces status reporting for the EtcdCluster custom resource (v1alpha1) as proposed in issue #135. It allows users to observe the actual state of the managed etcd cluster directly from the EtcdCluster object. Original reconciliation logic is preserved.

Changes Implemented:

API Definition (api/v1alpha1/etcdcluster_types.go):
- Defined the EtcdClusterStatus struct.
- Added the following fields to EtcdClusterStatus:
  - readyReplicas (int32)
  - members (int32)
  - phase (string)
  - Added conditions ([]metav1.Condition) and CurrentVersion (string) (TO-DO: Reporting logic)
- Enabled the /status subresource via the +kubebuilder:subresource:status marker.
Generated Code:
- Updated api/v1alpha1/zz_generated.deepcopy.go using make generate.
- Updated the CRD manifest config/crd/bases/operator.etcd.io_etcdclusters.yaml using make generate / controller-gen crd.
Controller Logic (internal/controller/etcdcluster_controller.go):
- Modified the Reconcile method to gather status information:
  - Reads status.ReadyReplicas from the managed StatefulSet.
  - Reads etcd cluster member count (status.members) from the healthCheck results.
  - Sets status.phase based on the observed state during reconciliation (e.g., Creating, Scaling, Running, Degraded, Failed).
  - Logics regarding to conditions and CurrentVersion are to be implemented.
- Implemented a helper function updateStatusIfNeeded for clarity and consistency in patching the status.
- Utilized a defer statement to call updateStatusIfNeeded before the Reconcile function returns, ensuring the latest calculated status is attempted to be patched via r.Status().Patch(ctx, ..., client.MergeFrom(oldStatus)).

How to Test:

Build and deploy the operator image from this PR branch: make docker-build docker-push IMG=<your-image> and make deploy IMG=<your-image>.
Create an EtcdCluster instance: kubectl apply -f config/samples/operator_v1alpha1_etcdcluster.yaml (ensure spec.size > 0 and spec.version is set).
Monitor the status updates: kubectl get etcdcluster <name> -o yaml --watch.
Verify that status.readyReplicas, status.members, and status.phase are populated and change according to the cluster's lifecycle.
Check operator logs for status update messages.

Future Considerations:

Implement population of status.currentVersion.
Implement handling and reporting using status.conditions.
Refine the status.phase logic for more accurate representation of intermediate and error states.

k8s-ci-robot · 2025-05-05T04:27:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ballista01
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-05-05T04:27:27Z

Hi @ballista01. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ArkaSaha30 · 2025-05-12T09:36:19Z

internal/controller/etcdcluster_controller.go

 			// attempt processing again later. This could have been caused by a
 			// temporary network failure, or any other transient reason.
 			logger.Error(err, "Failed to get StatefulSet. Requesting requeue")
+			etcdCluster.Status.Phase = "Failed"


Q: In a scenario where the StatefulSets are already created but the getStatefulSet failed, having the etcdCluster.Status.Phase as Failed might seem misleading, wdyt?
Maybe Unknown makes more sense?

I agree that Failed is misleading. However, Unknown is ambiguous to the user. How about Degraded? Anyway the user can find detailed info in the conditions field. The etcdCluster.Status.Phase is now a high level summary based on conditions.

Sure, Degraded is better in this case.
I was wondering if we should also keep a separate condition Unhealthy to have distinction between Degraded(when Get/Create response fails) and Unhealthy(when HealthCheck fails)

ahrtr · 2025-05-12T16:09:15Z

api/v1alpha1/etcdcluster_types.go

+	// ReadyReplicas is the number of pods targeted by this EtcdCluster with a Ready condition.
+	ReadyReplicas int32 `json:"readyReplicas,omitempty"`
+	// Members is the number of etcd members in the cluster reported by etcd API.
+	Members int32 `json:"members,omitempty"`
+	// CurrentVersion is the version of the etcd cluster.
+	CurrentVersion string `json:"currentVersion,omitempty"`
+	// Phase indicates the state of the EtcdCluster.
+	Phase string `json:"phase,omitempty"`
+	// Conditions represent the latest available observations of a replica set's state.
+	// +optional
+	// +patchMergeKey=type
+	// +patchStrategy=merge
+	// +listType=atomic
+	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`


I think we need to get consensus on the field list first. I think we need to record each member's status. cc @ivanvc @jmhbnz @jberkus @neolit123 @hakman @justinsb @ArkaSaha30

My draft thought:

- ReadyReplicas (similar to sts's ReadyReplicas) - CurrentReplicas (similar to sts's CurrentReplicas) - MemberCount (how many members in the etcd cluster) - Members (a slice) - name: etcd-1 healthy: true Alarm/Error: nil version: 3.5.21 storageVersion: 3.5.0 downgradeInfo: nil dbSize: 5000000 dbSizeInUse: 1000000 - name: etcd-2 health: true Alarm/Error: nil version: 3.5.21 storageVersion: 3.5.0 downgradeInfo: nil dbSize: 5000000 dbSizeInUse: 1000000 - name: etcd-3 health: true Alarm/Error: nil version: 3.5.21 storageVersion: 3.5.0 downgradeInfo: nil dbSize: 5000000 dbSizeInUse: 1000000 - Conditions []metav1.Condition

I was working on this direction last weekend. Will update a new draft soon.

I like the members slice idea. It provides more fine-grained info. Would love to work on that if we agree on it.

i don't mind if we push for an api that covers a number of interesting fields for the initial versions of the operator, but i think we should probably draft a google doc with a proposal and for the next version just push out a very simple spec that has just the etcd version and maybe health: true/false. note: even the health status needs some discussion of what it entails.

@ballista01 are you willing to draft such a api doc where we can rally discussions?

@neolit123 Sure! I'll post the link in slack.

What's the argument for the
dbSize: 5000000 dbSizeInUse: 1000000

... fields? Is there some way that cluster operators would use that info that requires it to be available via the kubernetes API, instead of just via etcdctl?

dbSize: 5000000 dbSizeInUse: 1000000

etcd-operator can only get such info from etcd cluster via etcd SDK API (similar to etcdctl). One possible use case is automatic compact & defragmentation. But as mentioned, we can get such info from etcd cluster directly. So In general, for any info that we can get from etcd cluster directly might not go into the Etcdcluster.Status.

As we discussed in the community meeting, one action item for now:

Investigate the best practice to populate the Status. For example, how K8s populates & use the Statefulset's Status. How other well-known operators populate & use the Status.

In the long-term, if we add any field/info into EtcdCluster.Status, we should have a clear goal first.

Yeah, I'm basically looking for use cases of "someone is going to use other Kubernetes components to automate this" which is a good justification for it being available via the Kubernetes API.

My suggestion regarding to the status field:

status: observedGeneration: 5 # NEW: Top-level field indicating the latest .metadata.generation processed by the controller currentReplicas: 3 # Unchanged: Matches the StatefulSet replica count currentVersion: 3.6.0-rc.3 # Unchanged: Represents the running etcd version of the cluster leaderId: 975eab669fe7b560 # Unchanged: Leader member ID for status observability memberCount: 3 # Unchanged: Total number of etcd members readyReplicas: 3 # lastDefragTime could be optionally added if controller manages defragmentation # lastDefragTime: "2025-07-20T12:00:00Z" # OPTIONAL: Cluster-level defragmentation completion time; not included in this PR members: - id: 22f4b6fa7639917a name: test-cluster-2 version: 3.6.0-rc.3 isHealthy: true isLearner: false # NEW: Added to support learner promotion logic and troubleshooting # REMOVED: dbSize/dbSizeInUse → considered too dynamic; suggested to be exposed via /metrics # clientURL and peerURL omitted here for minimal diagnostic view, can be re-added under a feature gate - id: 52117ab448fed65d name: test-cluster-1 version: 3.6.0-rc.3 isHealthy: true isLearner: false - id: 4f8c9e671cbd2233 name: test-cluster-0 version: 3.6.0-rc.3 isHealthy: true isLearner: false conditions: - type: Available status: "True" reason: ClusterReady message: Cluster is fully available and healthy observedGeneration: 5 lastTransitionTime: "2025-07-02T18:52:06Z" # Unchanged: Matches K8s convention for status.conditions - type: Progressing status: "False" reason: ReconcileSuccess message: Cluster reconciled to desired state observedGeneration: 5 lastTransitionTime: "2025-07-02T18:52:06Z" # Unchanged: Indicates that spec and observed state are aligned - type: Degraded status: "False" reason: ClusterReady message: Cluster is healthy observedGeneration: 5 lastTransitionTime: "2025-07-02T18:50:57Z" # Unchanged: Used to signal partial failures if any member is unhealthy # Note: dbSize and dbSizeInUse fields are REMOVED in this revision. # Rationale: These are volatile metrics not suitable for status fields; # better exposed via Prometheus-compatible /metrics endpoint for observability.

This is looking really good. However, opening a discussion so that we can hammer out final fields list. Issue #182

jberkus · 2025-05-19T20:20:39Z

/ok-to-test

ballista01 · 2025-06-06T00:10:30Z

Hi I know this PR is getting quite large, so I wanted to share my plan upfront. My main goal here is to use this as a tracker PR to run the full CI pipeline and give everyone a complete picture of the proposed changes.

Once CI passes and we're happy with the overall direction, I'll split this up into smaller, sub-PRs that will be much easier to review and merge.

I also noticed that after the foundational work, the current monolithic Reconcile function makes it tricky to write unit tests for the new status logic.

So, my next major step will likely be to refactor the Reconcile loop into smaller testable helper methods. I figure this might be a common challenge, so if there are already any existing issues, design docs, or discussions about refactoring the reconciler, please let me know! I'd love to build on any existing work.

I may start some exploratory work on the refactoring, but the final implementation in the smaller PRs will absolutely follow whatever we decide on as a community.

ballista01 · 2025-09-29T10:45:19Z

It seems that with the new status & condition reporting feature, the cyclomatic complexity of Reconcile is too high. I'll try to get #180 rebased and merged first to reduce the complexity of Reconcile.

…s integration, and etcd health improvements Signed-off-by: Wenxue Zhao <[email protected]>

ballista01 · 2025-10-14T22:56:55Z

Still working on reducing the complexity of status & condition reporting code in internal/controller/etcdcluster_controller.go.

- add condition helpers for consistent updates - reconcile controller to patch status and observedGeneration each loop Signed-off-by: Wenxue Zhao <[email protected]>

k8s-ci-robot · 2025-10-15T23:09:42Z

@ballista01: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-operator-verify	`9de5dcf`	link	true	`/test pull-etcd-operator-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the needs-ok-to-test label May 5, 2025

k8s-ci-robot added the size/L label May 5, 2025

ballista01 force-pushed the etcd-cluster-status branch from 7b62fb3 to 1081734 Compare May 5, 2025 04:31

ArkaSaha30 reviewed May 12, 2025

View reviewed changes

ahrtr reviewed May 12, 2025

View reviewed changes

k8s-ci-robot added size/XL and removed size/L labels May 12, 2025

ballista01 force-pushed the etcd-cluster-status branch from b1b18ae to a3d5999 Compare May 12, 2025 16:48

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels May 19, 2025

k8s-ci-robot added size/XXL and removed size/XL labels Jun 5, 2025

jberkus mentioned this pull request Jul 22, 2025

Determine final list of fields for 0.2 Status #182

Open

ballista01 force-pushed the etcd-cluster-status branch 2 times, most recently from c3e3b33 to 1ea2d50 Compare September 29, 2025 09:56

ballista01 mentioned this pull request Sep 30, 2025

REQUEST: New membership for @ballista01 kubernetes/org#5890

Closed

11 tasks

ballista01 force-pushed the etcd-cluster-status branch from 1814dc1 to 9e75a09 Compare October 14, 2025 22:29

feat: add cluster status/conditions, status manager, controller statu…

ae94635

…s integration, and etcd health improvements Signed-off-by: Wenxue Zhao <[email protected]>

ballista01 force-pushed the etcd-cluster-status branch from 9e75a09 to 678976d Compare October 14, 2025 22:47

ballista01 force-pushed the etcd-cluster-status branch from 4abe7b3 to d20b388 Compare October 15, 2025 22:42

feat: enhanced etcdcluster status management

9de5dcf

- add condition helpers for consistent updates - reconcile controller to patch status and observedGeneration each loop Signed-off-by: Wenxue Zhao <[email protected]>

ballista01 force-pushed the etcd-cluster-status branch from d20b388 to 9de5dcf Compare October 15, 2025 23:05

Implement Status Reporting for EtcdCluster #136

Are you sure you want to change the base?

Implement Status Reporting for EtcdCluster #136

Uh oh!

Conversation

ballista01 commented May 5, 2025

Uh oh!

k8s-ci-robot commented May 5, 2025

Uh oh!

k8s-ci-robot commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neolit123 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ballista01 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jberkus commented May 19, 2025

Uh oh!

ballista01 commented Jun 6, 2025

Uh oh!

ballista01 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ballista01 commented Oct 14, 2025

Uh oh!

k8s-ci-robot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

neolit123 May 13, 2025 •

edited

Loading

ballista01 Jul 22, 2025 •

edited

Loading

ballista01 commented Sep 29, 2025 •

edited

Loading